language for working with structured, semi-structured, and unstructured data; Kite: A set of libraries, tools, instances, and documentation that makes it easier to build systems on the ecosystem of Hadoop; Metamarkets Druid: Real-time e-Framework for large data sets; Onyx: Distributed cloud computing; Pinterest pinlater: Asynchronous task execution system; Pydoop: Python mapreduce and HDFs APIs for Hadoop; Rackerlabs Blueflood: Multi-tenant dis
Columnstore format for locally nested data structures, indexing for fast filtering, real-time ingestion and querying, and a highly fault-tolerant distributed architecture. From the official knowledge, Druid has the following main characteristics:
The design of--druid for analysis is built for the exploratory analysis of OLAP workflow, which supports a variety of filtering, aggregation and query classes;
Fast Interactive query --druid's low latency data ingestion architecture allows
machine, it does not load additional memory while maintaining efficient processing. This framework provides a flexible pluggable API: its default execution, message delivery, and storage engine operations can be replaced at any time depending on your choice. In addition, if you have a large number of data flow processing stages and separate teams from different code libraries, Samza's fine-grained work features are particularly useful because they can be added or removed with minimal impact.The
processing on the same machine, it does not load additional memory while maintaining efficient processing. This framework provides a flexible pluggable API: its default execution, message delivery, and storage engine operations can be replaced at any time depending on your choice. In addition, if you have a large number of data flow processing stages and separate teams from different code libraries, Samza's fine-grained work features are particularly useful because they can be added or removed
at addthis to collect events generated by our data network and broker thatData to our analytics clusters and real-time web analytics platform.
Urban Airship-at urban airship we use Kafka to buffer incoming data points from mobile devicesFor processing by our analytics infrastructure.
Metamarkets-we use Kafka to collect realtime event data from clients, as well as our own internal serviceMetrics, that feed our interactive analytics dashboards.
particular, data stream algorithms (e.g., K-mean streaming) allow spark real-time decision-making to be facilitated.Use Spark The companies are: Amazon, Yahoo, NASA JPL , EBay There are Baidu and so on. If you have a large number of States to work with, such as having many 1 billion-bit tuples per partition, you can choose Samza. Because Samza places storage and processing on the same machine, it does not load additional memory while maintaining efficient processing. This framework provides a f
engine s can each is replaced with your choice of alternatives. Moreover, if you had a number of data processing stages from different teams with different codebases, Samza ' s Fine-grai Ned jobs would be particularly well-suited, since they can is added/removed with minimal ripple effects.A few companies using Samza: LinkedIn, Intuit, Metamarkets, quantiply, Fortscale ...ConclusionWe only scratched the surface of the three Apaches.We didn ' t cover
Interesting readings
Big Data Benchmark–benchmark of Redshift, Hive, Shark, Impala and Stiger/tez.
NoSQL Comparison–cassandra vs MongoDB vs CouchDB vs Redis vs Riak vs HBase vs Couchbase vs neo4j vs Hypertable vs Elasti Csearch vs Accumulo vs Voltdb vs scalaris comparison.
Interesting Papers2013–2014
2014– Stanford –mining of Massive Datasets.
2013– Amplab –presto:distributed machine learning and Graph processing with Sparse matrices.
2013– Amplab –mlbase:a distributed
Netflix recently open source a tool called Suro, which the company can use to do real-time orientation of the data source host to the target host. Not only does it play an important role in Netflix's data pipeline, but it's also impressive for large-scale applications.Netflix's various applications generate tens of billions of of events per day, Suro can be collected before data is sent, then partially via Amazon S3 to Hadoop batch, and another part via Apache Kafka to Druid and Elasticsearch do
Contact Us
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.